Testing and summarizing relationship between 2 variables (correlation)
Pearson’s 𝒓 analysis (param)
Spearman test (no param)
Measures of association
Chi-Square test of independence
Fisher’s Exact Test
alternative to the Chi-Square Test of Independence
From correlation/association to prediction/causation
The purpose of observational and experimental studies
Widely used analytical tools
Simple linear regression models
Multiple Linear Regression models
Shifting the emphasis on empirical prediction
Introduction to Machine Learning (ML)
Distinction between Supervised & Unsupervised algorithms
R ENVIRONMENT SET UP & DATA
Needed R Packages
We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
We will also use the packages below (specifying package::function for clarity).
# Load them for this R session# General library(fs) # file/directory interactionslibrary(here) # tools find your project's files, based on working directorylibrary(paint) # paint data.frames summaries in colourlibrary(janitor) # tools for examining and cleaning datalibrary(dplyr) # {tidyverse} tools for manipulating and summarizing tidy data library(forcats) # {tidyverse} tool for handling factorslibrary(openxlsx) # Read, Write and Edit xlsx Fileslibrary(flextable) # Functions for Tabular Reporting# Statisticslibrary(rstatix) # Pipe-Friendly Framework for Basic Statistical Testslibrary(lmtest) # Testing Linear Regression Models # Testing Linear Regression Modelslibrary(broom) # Convert Statistical Objects into Tidy Tibbleslibrary(tidymodels) # not installed on this machinelibrary(performance) # Assessment of Regression Models Performance # Plottinglibrary(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
DATASETS for today
We will use examples (with adapted datasets) from real clinical studies, provided among the learning materials of the open access books:
Name: NHANES (National Health and Nutrition Examination Survey) combines interviews and physical examinations to assess the health and nutritional status of adults and children in the United States. Sterted in the 1960s, it became a continuous program in 1999. Documentation: dataset1 Sampling details: Here we use a sample of 500 adults from NHANES 2009-2010 & 2011-2012 (nhanes.samp.adult.500 in the R oibiostat package, which has been adjusted so that it can be viewed as a random sample of the US population)
Adapting the function here to match your own folder structure
We can start looking at how the model performs by applying it to our nhanes_test sub-sample, utilizing the function predict
Linear regression performance: predicted values in test sample
We can look the 95% CI of any predicted values
We can look the CI 95% of a single predicted values
Linear regression performance: RMSE
Basically we are asking: “how does the prediction compare to the actual test dataset?”
For this we take the difference between the predicted and the actual value as
RMSE = Root Means Squared Error
This is quite close to the Residual standard error that we got from the regression model summary (6.843) – despite that was taken from training data and this comes from testing data